QTM 447 Lecture 5: Loss Minimization and Optimization

Kevin McAlister

January 28, 2025

The Generic Problem

\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]

For models with parameters, our goal is to find a solution that minimizes the empirical risk.

We can express most problems with parameters, generally, in the following form:

\[\hbb = \underset{\bb}{\text{argmin }} \expn L(y_i , f(\bb , \mathbf x_i)) + \lambda C(\bb)\]

where \(C(\cdot)\) is a function of the compexity of the model that increases in complexity.

  • We can always set \(\lambda = 0\) to return the unregularized method!

The Generic Problem

Given a loss specification, the goal of machine learning methods fitting via the ERM framework is to find the parameters that minimize the loss

  • Under certain conditions, this empirical loss converges to the generalization error - our true goal!

Almost all of our commonly used supervised methods fall into this framework. Exceptions:

  • KNN (there are no parameters)

  • Tree based approaches (the fitting is done via binary recursive splitting methods)

  • Maybe some that I’m forgetting? Can y’all think of anything?

Regularized Least Squares

For problems with a continuous outcome, the loss function is often mean squared error:

\[\hbb = \argminb \expn (y_i - \phi(\mathbf x_i)^T \bb - \alpha)^2 + \lambda C(\bb)\]

  • OLS
  • Ridge/LASSO
  • Splines/GAMs
  • Kernel Regression

Regularized Logistic Regression

For problems with 2 or more discrete outcomes, the loss function is often cross-entropy:

\[\hbb = - \argminb \expn \sumk I(y_i = k) \log \theta_{ik} + \lambda \sumk C(\bb_k)\]

\[\theta_{ik} = \frac{\exp[\phi(\mathbf x_i)^T \boldsymbol \beta_k]}{\sum \limits_{h = 1}^K \exp[\phi(\mathbf x_i)^T \boldsymbol \beta_h]}\]

  • All the logistic variants of the linear methods

Support Vector Machines

Surprisingly, even SVMs fit into this framework!

Suppose that we have two classes labelled -1 and 1. We can define the SVM loss function as:

\[\hbb = \argminb \expn \varphi(1 - y_i(\phi(\mathbf x_i^T \bb - \alpha))) + \lambda \|\bb\|^2\]

where \(\varphi(a) = \text{max}(0,1 - a)\) of the hinge loss

  • We don’t frequently fit SVMs this way because the hinge loss is not twice differentiable everwhere

  • Fit most frequently using LIBSVM - a very fast constrained optimization software designed to solve the coresponding quadratic programming problem.

Minimization

All of these methods require finding the coefficients that minimize the loss function!

The question then becomes how can we find the minimum.

Minimization

A quick side note:

We sometimes encounter maximization problems. We’ll always switch this to a corresponding minimization problem.

Example: Maximum likelihood estimation

Given a log likelihood, \(\log f(\bb | \mathcal D)\), we want to find:

\[\hbb = \argmaxb \log f(\bb | \mathcal D)\]

Without loss:

\[\hbb = \argminb - \log f(\bb | \mathcal D)\]

Minimization

Notation Alert: For my own sanity, I’m going to omit the feature extractor function \(\phi(\mathbf x_i)\) from the rest of this lecture. We can always transform the features before we pass them to these different learning methods!

Minimization

Sometimes, we get lucky and can solve for the minimum by hand.

Example: Least Squares w/ L2 regularization

\[\hbb = \argminb \expn (y_i - \mathbf x_i^T \bb - \alpha)^2 + \frac{\lambda}{N} \sum \limits_{j = 1}^P \beta_j^2\]

\[\hbb = (\mathbf X^T \mathbf X + \lambda \mathcal I_P)^{-1} \mathbf X^T \mathbf y\]

  • Note that we’re dividing by \(N\) to keep everything as an average over observations.

  • This will pay off later.

Minimization

We know that this corresponds to a point in the parameter space where the gradient for the coefficients is equal to zero.

How do we know that this is a minimum?

How do we know that this is a global minimum?

Minimization

Notation:

Let \(\mathcal L(\boldsymbol \theta)\) be our loss function where \(\theta\) is the set of parameters we would like to learn from the data.

Let \(\nabla \mathcal L(\boldsymbol \theta) = \mathbf g(\boldsymbol \theta)\) be the gradient vector for the loss function w.r.t. the unknowns:

\[\gtheta = \begin{bmatrix} \frac{\partial \mathcal L(\boldsymbol \theta)}{\partial \theta_1} \\ \frac{\partial \mathcal L(\boldsymbol \theta)}{\partial \theta_2} \\ \frac{\partial \mathcal L(\boldsymbol \theta)}{\partial \theta_3} \\ \vdots \end{bmatrix} \]

Minimization

Jacobians:

A matrix of partial derivatives with the objectives on the rows and the unknowns on the columns.

In other words, each row of the Jacobian is a gradient. For scalar value outputs (like a loss function):

\[ J(\boldsymbol \theta) = \gtheta^T \]

Minimization

Notation:

Let \(\nabla^2 \mathcal L(\boldsymbol \theta) = \htheta\) be the Hessian matrix:

\[\htheta = \begin{bmatrix} \frac{\partial \mathcal L(\boldsymbol \theta)}{\partial \theta_1\partial \theta_1} & \frac{\partial \mathcal L(\boldsymbol \theta)}{\partial \theta_1\partial \theta_2} & \frac{\partial \mathcal L(\boldsymbol \theta)}{\partial \theta_1\partial \theta_3} & \ldots \\ \frac{\partial \mathcal L(\boldsymbol \theta)}{\partial \theta_2\partial \theta_1} & \frac{\partial \mathcal L(\boldsymbol \theta)}{\partial \theta_2\partial \theta_2} & \frac{\partial \mathcal L(\boldsymbol \theta)}{\partial \theta_2\partial \theta_3} & \ldots \\ \frac{\partial \mathcal L(\boldsymbol \theta)}{\partial \theta_3\partial \theta_1} & \frac{\partial \mathcal L(\boldsymbol \theta)}{\partial \theta_3\partial \theta_2} & \frac{\partial \mathcal L(\boldsymbol \theta)}{\partial \theta_3\partial \theta_3} & \ldots \\ \vdots & \vdots & \vdots & \ddots \end{bmatrix} \]

Can also define the Hessian as:

\[ \frac{\partial J(\boldsymbol \theta)}{\partial \boldsymbol \theta} \]

Minimization

For a function, a point, \(\boldsymbol \theta^*\) is a local minimum (the lowest point in a neighborhood of the parameter space) if:

  • \(\mathbf g(\boldsymbol \theta^*) = \mathbf 0\)

  • \(\mathbf H(\boldsymbol \theta^*)\) is positive definite. For a \(P \times P\) Hessian matrix, we can define positive definiteness as:

\[\mathbf b^T \mathbf H(\boldsymbol \theta^*) \mathbf b > 0 \text{ for all } \mathbf b \in \mathbb R^P\]

Minimization

Over the set of all points, \(\boldsymbol \theta^*\), that are local minima, the global minimum is the one that achieves the lowest value w.r.t the loss function.

  • It is always possible that a loss function has multiple sets of unknowns that lead to local minima.

  • However, we’re not going to worry about that too much

Most common ML algorithms are constructed in such a way that they admit a single minimum

  • The loss functions are strictly convex

Minimization

Suppose our loss function is twice differentiable over it domain.

The loss function is strictly convex if \(\htheta\) is positive definite over its entire domain!

  • Strictly convex functions are guaranteed to a have a single local minimum

  • Therefore, it must be the global minimum

Minimization

Let’s go back to L2 regularized linear regression:

\[\mathcal L(\boldsymbol \beta, \alpha) = \expn (y_i - \mathbf x_i^T \boldsymbol \beta - \alpha)^2 + \frac{\lambda}{N} \sum \limits_{j = 1}^P \beta_j\]

and argue that the minimum:

\[\hbb = (\mathbf X^T \mathbf X + \lambda I_P)^{-1} \mathbf X^T \mathbf y\]

is a global minimum.

  • For brevity, we’re going to assume that the outcome is centered so that \(\alpha = 0\).

Minimization

Let’s start by finding the gradient for the unknown parameters - \(\bb\)

\[\mathbf g(\bb) = - \expn 2(y_i - \mathbf x_i^T \bb)\mathbf x_i + \frac{2 \lambda}{N} \bb\]

which can be more compactly written as:

\[\mathbf g(\bb) = -\frac{2}{N} (\mathbf y - \mathbf X \bb )^T \mathbf X + \frac{2 \lambda}{N} \bb\]

Minimization

Now, find the Hessian:

\[\frac{\partial \mathcal L(\boldsymbol \beta, \alpha)}{\partial \beta \partial \beta} = \frac{2}{N} \sum \limits_{i = 1}^N \mathbf x_i \mathbf x_i^T + \frac{2}{N} \lambda \mathcal I_P\]

which can be more compactly written as:

\[\mathcal H(\bb) = \frac{2}{N}\left[ \mathbf X^T \mathbf X + \lambda \mathcal I_P \right]\]

Minimization

To show that this matrix is positive definite, we only need to look at the definition for positive definiteness:

\[\frac{2}{N} \mathbf b^T \left[ \mathbf X^T \mathbf X + \lambda \mathcal I_P \right] \mathbf b > 0 \text{ for all } \mathbf b \in \mathbb R^P\]

Multiplying everything through:

\[(\mathbf X \mathbf b)^T (\mathbf X \mathbf b) + \lambda \mathbf b^T \mathbf b\]

which is necessarily positive for any \(\mathbf b \neq \mathbf 0\).

Minimization

Let’s demonstrate this using a sample data set.

Minimization

Minimization

Neat!

We have a strictly convex loss function and a unique global minimizer.

Now, let’s move on to the more novel case.

Logistic Regression

For now, let’s concentrate on binary logistic regression with no penalty.

\[\hbb = \argminb -\expn y_i \log \theta_{i} + (1 - y_i) \log (1 - \theta_{i})\]

\[\theta_{i} = \frac{\exp[z_i]}{1 + \exp[z_i]}\]

\[ z_i = \mathbf x_i^T \boldsymbol \beta \]

Goal: Find \(\boldsymbol \beta\) that minimizes the negative log likelihood.

Logistic Regression

Let’s work through this one together on the board.

Logistic Regression

The derivation of the gradient for logistic regression isn’t too difficult

  • Outlined in the corresponding lecture notes

The gradient:

\[\mathbf g(\bb) = \expn \left[\sigma(\mathbf x_i^T \boldsymbol \beta) - y_i\right]\mathbf x_i\]

Logistic Regression

Unfortunately, neither gradient admits an analytical solution for the values at the critical point!

  • We can’t set them to zero and parse out the solutions.

We’ll need to use computational minimization methods

Logistic Regression

Before we do this, though, it is useful to check if there exists a unique solution for the coefficients.

With no penalty, we can simplify this problem to just checking that the Hessian w.r.t. \(\bb\) is always positive definite (since we can always just add a column of 1s).

Logistic Regression

Take a few minutes to try to derive the Hessian for the logistic regression coefficients. Also try to put it in matrix form.

The Jacobian (rewritten in a form that is a little easier to work with):

\[\mathbf J(\bb) = \mathbf g(\bb)^T = \expn \left[\sigma(\mathbf x_i^T \boldsymbol \beta) \mathbf x_i^T - y_i \mathbf x_i^T \right]\]

Some helpful hints:

  • \(\frac{d}{d\boldsymbol \beta} \sigma(\mathbf x^T \boldsymbol \beta) = \left[\sigma(\mathbf x^T \boldsymbol \beta)(1 - \sigma(\mathbf x^T \boldsymbol \beta))\right]\mathbf x_i\)

  • \(\frac{d}{d \boldsymbol \beta} \mathbf x_i^T \boldsymbol \beta = \mathbf x_i\)

  • Order matters! \(\frac{d}{dx} f(g(x)) = \frac{d}{du} f(u) \frac{d}{dx} g(x)\)

  • The final matrix should be \(P \times P\)

Logistic Regression

This Hessian can be rewritten in matrix form as:

\[\mathbf X^T \mathbf S \mathbf X\]

where \(\mathbf S\) is a \(N \times N\) matrix with \(\sigma(\mathbf x_i^T \boldsymbol \beta)\left[ 1 - \sigma(\mathbf x_i^T \boldsymbol \beta) \right]\) along the diagonal.

Since \(\mathbf S\) is a diagonal matrix with only positive elements, we can show that:

\[\mathbf b^T \mathbf X^T \mathbf S \mathbf X \mathbf b = (\mathbf b^T \mathbf X^T \mathbf S^{1/2}) (S^{1/2} \mathbf X \mathbf b) = \|\mathbf b^T \mathbf X^T \mathbf S^{1/2}\|^2\]

which must be positive.

Gradient Descent

Given the gradient and Hessian, how can we find \(\hbb\)?

Let’s start by assuming that we have a “guess” for the parameter values, \(\bb_0\).

  • More than likely \(\bb_0 \neq \hbb\)

We would like to update \(\bb_0\) to \(\bb_1\) such that:

\(\mathcal L(\bb_1) < \mathcal L(\bb_0)\)

Gradient Descent

Let \(\mathbf d\) be a descent direction and \(\eta > 0\) a step size such that:

\[\mathcal L(\bb_0 + \eta \mathcal d) < \mathcal L(\bb_0)\]

  • \(\mathbf d\) is a vector that defines a direction in \(\mathbb R^P\)

  • We’ll just assume that \(\mathbf d\) is unit normed - \(\|\mathbf d \| = 1\). The norm of \(\mathbf d\) doesn’t really matter.

  • \(\eta\) should be sufficiently small. For now, just assume it’s \(\epsilon\) away from zero.

Gradient Descent

Given the gradient for the loss function evaluated at our current guess, \(\mathbf g(\bb_0)\), we can define the directional derivative in a direction \(\mathbf d\) as:

\[\mathbf d^t \mathbf g(\bb_0)\]

  • How much does \(\mathcal L(\bb)\) change if we move \(\epsilon\) away from \(\bb_0\) along \(\mathbf d\)?

The law of cosines says that an inner product of two vectors can be defined as:

\[\|\mathbf d\| \|\mathbf g(\bb_0)\| \cos(\theta)\]

where \(\theta\) is the angle between \(\mathbf d\) and \(g(\bb_0)\).

Gradient Descent

To be most efficient - decrease the loss the most in a single step - we want to move in the direction of steepest descent

\[\mathbf d^* = \underset{\mathbf d}{\text{argmin }} \|\mathbf d\| \|\mathbf g(\bb_0)\| \cos(\theta)\]

  • We know that \(\|d\|\) is 1.

  • \(\|\mathbf g(\bb_0)\|\) doesn’t depend on \(\mathbf d\)

Gradient Descent

\(\cos(\theta)\) ranges between -1 and 1.

For what \(\theta\) does \(\cos(\theta) = -1\)?

Therefore, we want to choose \(\mathbf d\) such that it moves exactly away from the gradient vector!

This means that the direction of steepest descent is the negative gradient!

Gradient Descent

We are guaranteed that if we move \(\epsilon\) away from \(\bb_0\) in the direction of the negative gradient, that we will decrease the loss with the new guess:

\[\bb_1 = \bb_0 - \epsilon \mathbf g(\bb_0)\]

If we do the same thing at \(\bb_1\)

\[\bb_2 = \bb_1 - \epsilon \mathbf g(\bb_1)\]

we are guaranteed to decrease the loss function even more!

Repeating this a lot of times, we’ll eventually get to a point where:

\[g(\bb_t) = \mathbf 0\]

and the negative gradient is just the gradient. We’re stuck!

Gradient Descent

Fortunately, this is where we want to land! A local minimum.

  • Global if we can show that the loss function is strictly convex.

The gradient descent algorithm:

Set initial value, b
t = 0
gt = Gradient at b
while gt != 0:
  Evaluate gradient at b as gt
  b = b - (eps*gt)
  t += 1

print(b) 

Gradient Descent

Guarantee: with enough iterations and sufficiently small step size, we are guaranteed to find the minimum with this algorithm!

However, the number of iterations needed to get there is inversely related to the step size

  • And we’re only guaranteed to find the exact minimum if the step size is \(\epsilon\) because derivatives

In practice, we’d be dust in the wind by the time it finished if we set the step size to the machine minimum

Gradient Descent

Instead, choose a “small” step size and stop the algorithm when the change in loss is sufficiently small

  • Record the loss at each candidate value of the coefficients

  • If the difference between the loss at the next step and the current one is less than a small value (I frequently set mine at \(10^{-8}\)), call it close enough!

How small is small?

Logistic Regression

Let’s look at this by examining a logistic regression fit.

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Gradient Descent

Gradient descent works!

  • Mostly

One issues arise when the step size is either too small or too large

  • In this case, it needed a really large step size since the loss function has a pretty “smooth” gradient

  • Also, it’s low dimensional. So, it doesn’t have that many directions to try.

Gradient Descent

Other problem: computational cost!

For gradient descent, the total computational complexity depends on the step size and the size of training features.

For step size/number if iterations to convergence:

\[\mathcal O(\log 1/\eta)\]

Within each gradient evaluation:

\[\mathcal O(NP)\]

Gradient Descent

This will not be trivial for modern examples!

  • Large \(P\)

  • Large \(N\)

This adds up!

  • My version of gradient descent for logistic regression takes roughly 1 ms per iteration

  • Assuming everything scales linearly, a 2D logistic regression problem with 1 million obs would take roughly 5.5 hours to complete in the worst case scenario with 10,000 iterations to convergence!

Stochastic Gradient Descent

Is there a time-saving shortcut we can take?

Let’s go back to the initial generalization problem. We want to find a solution that minimizes the generalization error:

\[E_\mathcal T[\mathcal L(y , \hat{y})]\]

We want to approximate this expectation with the empirical risk over the training data:

\[\expn \mathcal L(y_i , \hat{y}_i)\]

Stochastic Gradient Descent

Over the training data, we want to evaluate the gradient of the loss at a given value of the unknowns. The derivative of a sum is a sum of the derivatives, so:

\[\frac{1}{N} \mathbf g(\bb; \mathbf X , \mathbf y) = \expn \mathbf g(\bb ; \mathbf x_i , y_i)\]

For problems that attempt to minimize the empirical risk of i.i.d. draws, the gradient is the average of the gradient evaluations at each training point!

Stochastic Gradient Descent

A thought: rather than evaluating the gradient at all \(N\) training points, sample a minibatch of size \(B << N\) from the training data, \(\mathcal B\). Approximate the gradient evaluated at \(\bb\) with:

\[\frac{1}{B} \sum \limits_{j \in \mathcal B} \mathbf g(\bb ; \mathbf x_j , y_j)\]

It turns out that this is a really good approximation for

\[\frac{1}{N} \mathbf g(\bb; \mathbf X , \mathbf y)\]

Any thoughts as to why?

Stochastic Gradient Descent

A sampled variant of gradient descent:

Set initial value, b
t = 0
gt = Gradient at b
B > 0
while gt != 0:
  Sample B obs from N
  Evaluate gradient at b as gt
  b = b - (eps*gt)
  t += 1

print(b) 

Stochastic Gradient Descent

When \(B = 1\), this method is called stochastic gradient descent

When \(B > 1\), this method is called batch gradient descent

A natural question:

Does this actually work? Can we just get a noisy approximation of the gradient at each step and eventually land where we want to land?

Stochastic Gradient Descent

The proof that SGD converges to a minimum is a little more involved than the one for GD.

Some key points:

  • Due to noise in the training sample, any one update is not guaranteed to lower the loss

  • However, as long as the samples are truly i.i.d., each average (even over just one observation) is unbiased for the full sample gradient

  • In expectation we’re moving in the correct direction each time

Stochastic Gradient Descent

However, we’re not guaranteed to converge to the point corresponding to the minimum!

  • The stochastic nature of the sampling means that we’ll converge to a noise ball around the minimum whose size is proportional to the step size and inversely proportional to the proportion of the training data that we use in our batches.

Stochastic Gradient Descent

The biggest benefit is decreased computation time. For SGD, the computational complexity is \(\mathcal O(P/\eta)\)

  • We lose the log decrease w.r.t. to the step size for linear decrease

  • But, we also lose the increase due to \(N\)!

Other benefits:

  • Won’t waste time recomputing the gradient for training instances that are largely the same - \(\mathbf x_i \approx \mathbf x_j\)

  • Our initial guess is probably pretty bad. Quickly compute some estimate of the gradient and get out of there in, hopefully, a negative direction.

Logistic Regression

Let’s look at this by examining a logistic regression fit.

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Stochastic Gradient Descent

With \(B = 1\), it just stops…

  • The algorithm never gets where it’s supposed to go.

Stopping rule:

  • If the values change by less than \(\epsilon\), terminate

  • Not a good idea for SGD. Why?

Alteration 1:

Just run the algorithm for a fixed number of iterations not stopping when the loss “plateaus”

Logistic Regression

Logistic Regression

Logistic Regression

Stochastic Gradient Descent

That works! But heavily dependent on step size

  • When \(\eta\) is too small, we really don’t get there

  • When \(\eta\) is too big, we have a large noise ball

Why does the noise ball occur?

Stochastic Gradient Descent

An aside:

How do we choose the batches? What constitutes an iteration?

Convergence properties hold when the batches are sampled randomly

  • Intuition?

From an engineering perspective, extremely inefficient for large data sets

  • Load in random sets on demand

  • Load in all data and access when needed

Both are prohibitively sloooooowwwwww when \(N\) is large!

  • Also, may never touch an important data point

Stochastic Gradient Descent

Shuffle/pass method:

  • At the beginning of an epoch, shuffle the data set and split into exclusive sets with batch size

  • Each batch is used to compute gradient move sequentially.

  • After touching all data, end epoch and reshuffle

  • Repeat

Stochastic Gradient Descent

More efficient because method can pre-cache the next data set while gradient computation is occurring

  • The full list of instructions for an epoch are already given before gradient computation starts

This is the standard algorithm for PyTorch/Tensorflow

Stochastic Gradient Descent

Two approaches to make the noise ball less problematic.

Approach #1: Learning Rate Schedules

Rather than using a fixed step size, start with a large step size and temper the learning rate as the algorithm proceeds.

Most common scheduler: Polynomial Decay

\(\eta_t = \eta_0(b t + 1)^{-a}\)

  • Common parameter values: \(a = .5\) and \(b = 1\)

Stochastic Gradient Descent

Stochastic Gradient Descent

This schedule starts with large steps then quickly decays to smaller steps.

This make sense:

  • Our initial guess is probably pretty bad. Move away from it in a hopefully decreasing direction quickly.

  • Once we move onto a path that is decent, start moving a little slower.

  • At the end, we’re hopefully around the minimum and we’ll tiptoe towards the actual minimum

Logistic Regression

Stochastic Gradient Descent

With polynomial scheduling, we get to the minimum very quickly with just 1 example per iteration!

  • Fast for logistic regression

  • Will make computation time much smaller for more complicated methods.

Stochastic Gradient Descent

Approach #2: Iterate Averaging

Run through the SGD algorithm as usual computing

\[\bb_{t + 1} = \bb_{t} - \eta \mathbf g(\bb_t)\]

However, denote \(\bar{\bb}\) as the running average of the coefficients at each step

\[\bar{\bb}_{t} = \frac{1}{t} \sum \bb_t = \frac{1}{t} \bb_t + \frac{t - 1}{t} \bar{\bb}_{t - 1}\]

Use \(\bar{\bb}_t\) instead of the final \(\bb_t\) as the estimate of the loss minimizer!

Stochastic Gradient Descent

Iterate averaging can improve on SGD by forcing the actual estimate to be an average over the giant noise ball.

  • Most steps are around the minimum, so average over all of the random movements to get something a little better

No learning schedule required

This procedure is actually guaranteed to provide optimal variance over vanilla SGD implementations.

  • The bias of the estimated gradient decays geometrically

  • The variance is optimal (e.g. achieves the smallest variance that can be achieved at any iteration)

Logistic Regression

Stochastic Gradient Descent

All in all, SGD is really quick!

  • It scales nicely to larger training sets

Note that this method will work given any gradient for the unknowns as long as the objective can be expressed as an average.

  • No matter how complicated.

However, vanilla SGD is still pretty inefficient

  • It wanders around in directions that are not moving directly towards the minimum

  • Wasted time…

Stochastic Gradient Descent

Next class, we’ll talk about modern improvements to the SGD algorithm that will help it move more efficiently

  • Less jumpiness

Momentum, Adaptive Gradients, and ADAM.